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_ ABSTRACT: Two non-integer parameters are defined for MAX statistics, which are maxima of d simpler 
O^. test statistics. The first parameter, (Imax, is the fractional number of tests, representing the equivalent 
numbers of independent tests in MAX. If the d tests are dependent, dm ax < d. The second parameter is the 
I> ■ fractional degrees of freedom k of the chi-square distribution Xk tha-t fits the MAX null distribution. These two 
QQ parameters, dMAX and fc, can be independently defined, and k can be non-integer even if dMAX is an integer. 

, We illustrate these two parameters using the example of MAX2 and MAX3 statistics in genetic case-control 
^\ t studies. We speculate that k is related to the amount of ambiguity of the model inferred by the test. In the 
■ case-control genetic association, tests with low k (e.g. fc = 1) are able to provide definitive information about 
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the disease model, as versus tests with high k (e.g. fc = 2) that are completely uncertain about the disease 
model. Similar to Heisenberg's uncertain principle, the ability to infer disease model and the ability to detect 
significant association may not be simultaneously optimized, and k seems to measure the level of their balance. 
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1 Introduction 

Geometric objects with non-integer dimensions such as coastal hnes, random walk trajec- 
tories, Koch snowflakes have been well known |T]. Besides the feature of self-similarity, an 
important property of most fractals is their non-integer dimensionality. It is perhaps less 
known that non-integer or fractional parameter values is also a valid concept in statistical 
distributions. The best example is the fractional degrees of freedom (df). The (chi-square) 
distribution concerns the sum of squares of standard normal (Gaussian) variables. If Xi and X2 
are two normally distributed variables with zero mean and unit variance, Y = Xf +X| is then 
distributed as the with two {k = 2) degrees of freedom, denoted by xi=2- The analytic ex- 
pression of the probability density distribution of xl is known: 0.5^^"^ /T(k/ 2) ■ x^^'^~^exp{—x / 2) , 
where F is the Gamma function [2]. In this expression, there is no conceptual difficulty to ex- 
tend an integer value of k to non-integers. However, since there is a specific meaning of k in 
the original definition of chi-square distribution, i.e., the number of standard normal variables 
to be summed, one may wonder whether non-integer degrees of freedom, though allowed, have 
any applications. 

The chi-square distribution plays an essential role in genetic association analysis, whose goal 
is to determine whether a genetic marker on a particular chromosome location is associated 
(correlated) with a human disease or presence/absence of a phenotype of interest [31 HI [5]. 
The simplest genetic marker has two possible "values" (alleles), written as a and A. Because 
half of the genetic material of a person is from the father (F), and another half from the 
mother (M), a marker configuration can be written as F|M. The two-allele marker has four 
possible configurations: a\a, a\A, A\a, and A\A. If we cannot distinguish the parental origin of 
an allele easily, as in the case with most technologies current in use, A\a and a\A are grouped 
into one configuration, and the resulting three configurations (after dropping the vertical bar), 
aa, aA, AA, are called genotypes. 

The most popular design for genetic association study right now is the case-control design [6l 
[3, El [HI [IQ] . In this design, a group of patients (cases) and a group of disease- free normal persons 
(controls) are recruited, whose DNA molecules extracted, and their genotype throughout the 
genome (e.g. 10^ — 10^ markers on 23 chromosomes) are determined ( "geno typed" ) . For a 
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particular marker, the number of case (and control) samples with the aa, aA, AA genotypes 
are counted. These six genotype counts are stored in a 2-by-3 contingency table, rows for two 
disease status and columns for three genotypes. Many null hypothesis can be tested, and a 
significant violation of the null is used as evidence for genetic association between the marker 
and the disease. Exploration of the protein-coding genes near the marker could provide further 
insight into the mechanism for the disease. 

Establishing the null hypothesis is not as easy as first thought. One obvious choice is 
to assume the three genotype frequencies to be unchanged in the two (case and control) 
groups. If we use the Peason's chi-square test (goodness-of-fit test), the null distribution of 
the test statistic is Xk=2- The relation between the degree of freedom k and the size of the 
contingency table is straightforward: k is equal to the number of rows minus 1 multiplied by 
the number of columns minus 1 [H] . On the other hand, if the allele A "dominates" allele a, 
there is no difference between the aa and aA genotypes; and after combining the aa and aA 
columns, the original 2-by-3 table becomes a 2-by-2 table, and the test statistic follows the 
Xk=i ^^11 distribution. The similar collapses from 2-by-3 to 2-by-2 table could be carried out 
in several other ways, corresponding to "recessive", "multiplicative", "over-dominant", etc. 
disease models, each has a Xk=i ^luU distribution for the corresponding test statistic. 

If the disease model is known, i.e., if we know the disease risk given a genotype, one can 
easily choose the null hypothesis and a test so that deviation from the null could be detected. 
Unfortunately, for many complex human diseases, due to the multiple genes nature and gene- 
environment interaction, the disease model for a specific risk gene is largely unknown. To 
increase the chance to detect the association signal under the situation of model uncertainty, 
several tests, each testing a different null hypothesis, could be applied, and the best result 
among them is used. We call this procedure the "MAX test" . The MAX test that maximizes 
the test statistics from two or three disease models is a compromise between using one simple 
disease model and using no models. As a result, the null distribution of MAX test statistics 
is neither Xk=i '^'^^ Xk=2y t"ut something in between. We will show that this indeed leads to a 
fractional degrees of freedom k for x|, and 2 — k measures our knowledge about the disease 
model. The determination of k is complicated by another issue that sometimes the two or 
three test statistics being maximized are not independent. That leads to another fractional 
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parameter: the number of independent tests (Imax- 

Although these two fractional parameters are not the same as the fractal dimension for 
fractals, a common theme is the non-integer value. We will explore the properties of these 
two parameters in details in this paper, organized as follows: Section 2 introduces statistical 
tests and MAX statistical test; Section 3 discusses the fractional number of tests for MAX 
test, from the perspective of family- wide p- values; Section 4 discusses the fractional degrees 
of freedom of chi-square distribution, from the perspective of fitting null distribution of MAX 
test statistics. In the discussion section, we address the issue on whether the fractional degrees 
of freedom is connected to fractal dimension in the parameter space. 

2 MAX statistical test 

When two different statistical tests are carried out on the same dataset [12], the more 
significant result of the two (i.e., the more extreme test statistic value) can be reported as 
the overall test result. This is a MAX statistical test. Clearly, MAX test statistic will always 
be larger than (or at least equal to) individual tests being maximized. Although the null 
distribution of a MAX test statistic may not be expressible by a simple analytic formula, we 
do expect the "center of gravity" of the distribution to be shifted to the right to have a larger 
mean and a thicker tail area ("inflated type I error"), as compared to that of a individual test, 
for the obvious reason that the maximization procedure increases the mean value. 

Here we would like to define a MAX statistic for the case-control genetic association study. 
A dataset of such study consists of six numbers: number of case samples with aa, aA, AA 
genotypes (A'^io, A^^n, A'^12), and the number of control samples with these three genotypes 
(A^'oo, A'oi, A''o2) (see Appendix). The row i in Nij indicates the case (1) or control(O) status, 
and column j indicates the genotype with j copies of A allele. 

We consider three different tests which are part of a test family called Cochran- Armitage 
trend (CAT) test fiSi [H]. This family of tests is parameterized by a x value, and the null 
hypothesis is the equality of weighted genotype frequency xPaA + Paa in case and control 
group. When x = 0, we are testing PAA,case = PAA,controi, which corresponds to the genetic 
recessive model on the risk allele A. When a; = 1, we are testing the equality of PaA + Paa 
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in case and control group, which corresponds to genetic dominant model (whenever the risk 
allele A is present in a genotype, the disease risk is the same regardless of the second allele). 
When X = 0.5, we are testing the equality of the allele frequency, PaAf^ + Paa = -P4, in the 
two groups. 

The expression of CAT test statistic is given in Appendix. There are other reformula- 
tion of the above formula, such as using the estimated allele frequency difference and Hardy- 
Weinberg disequilibrium coefficient difference [TB], but the simplest calculation of CAT{x = 0) 
or CAT{x = 1) is to merge the a A counts (j = 1) with the aa counts (j = 0) or AA counts 
(j = 2), then calculate the Pearson's chi-square test statistic (see Appendix). 

If the underlying disease model is dominant, multiplicative, or recessive, the CAT{x = 1), 
CAT{x = 0.5), or CAT{x = 0), respectively, tends to be the largest. Fig{T] shows an example 
using a dominant model. The histogram determined by 100,000 replicates for CAT{x = 1) is 
peaked at the higher value than the other two CAT's, CAT{x = 0.5) is distributed slightly 
lower than CAT{x = 1), whereas the distribution of CAT{x = 0) is far towards the smaller 
values. 

If the disease model is unknown, it is when a MAX statistic is useful. One may consider 
these MAX statistics for case-control genetic data: 

MAX2 = max{CAT{x = 0),CAT{x = l)) (1) 
MAX3 = max{CAT{x = 0), CAT{x = 0.5), CAT{x = 1)) 

MAX2 was discussed in [ISl [16], and MAX3 was discussed in [HI HH HQ] . Both MAX2 and 
MAX3 are "smart" samplings of the disease model space without an exhaustive search. 

3 The fractional number of tests in calculating test-family wide p- 
values in MAX statistics 

When several tests are applied to the same dataset and these tests are independent, there 
is a simple formula for calculating the test-family-wide p-value, which is also the p-value for 
the MAX test. We can derive the tail probabilities under the null distribution (i.e., p- value), 
PMAX2 and PMAX3, with the tail starting from the observed test statistic value M by the 
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following procedure {p^2 is the tail area probability under Xk=i distribution): 

PMAX2 = P{MAX2 > M\null) 

= 1 - P{MAX2 < M\null) 

= 1 - P{CAT{x = 0) < M and CAT{x = 1) < M\null) 

^ 1 - P{CAT{x = 0) < M\null) x P{CAT{x = 1) < M\null) 

= 1 - (1 - P{CAT{x = 0) > M|nn//)) x (1 - P{CAT{x = 1) > M\null)) 

= l-{l-p^.f 

PMAX3 = P{MAX3> M\null) 

= l-P{MAX3<M\null)^--- = l-{l-p^2f. (2) 

The approximation can be replaced by the equal sign for MAX2 only if CAT{x = 0) and 
CAT{x = 1) are independent, and for MAX3 only if CAT{x = 0), CAT{x = 1) and CAT{x = 
0.5) are independent. The independence assumption is untrue for MAX3 [IT], but close to 
be true for MAX2 [16]. The two approximate formula in Eq.Q, also known as Dunn-Sidak 
formula [20l [2T] . can be written as 1 — (1 — p^2)"' for d tests being maximized in MAX. 

If we force the approximation sign in Eq.([2]) to be equality, d can be derived from Pmax- 
This value of d (called dMAx here) represents the effective number of independent tests: 

log(l - Pmax) 

dMAX - -j 7Z r— . [6 

Note that the tail area probabilities for both MAX and chi-square, Pmax and p^2, are deter- 
mined by the same M, the starting position of the tail area. 

Besides multiple testing correction in test-family- wide p- value on the same dataset, the 
Dunn-Sidak formula can also be used with the same test on multiple datasets. In particular, 
in whole genome association or linkage studies, selecting the SNP with the best association 
or linkage signal among ~ 10^ SNPs belong to this application [22l|23], and the genome-wide 
p-value is calculated in the same way. The severe correction on p-value in this application is 
in a sharp contrast to the correction in Eq.([2]), of a factor of only 2 or 3. 

In order to estimate the effective number of tests for MAX2 and MAX3, we carried out the 
following simulation. We generated A''^ =100,000 replicates, each replicate is a 2-by-3 genotype 
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counts for 1000 cases and 1000 controls. The allele frequency is randomly chosen but the same 
allele frequency is used to simulate both case and control genotypes. The genotype frequency 
is derived from the allele frequency by the Hardy- Weinberg equilibrium. The empirical distri- 
bution of MAX2, MAX3, CAT{x = 0), CAT{x = 1), CAT{x = 0.5) can all be determined 
with 100,000 realizations of test statistic values (and the minimum p-value can't be smaller 
than 1/100,000= 10~^). Using several threshold M value (controlling type I error), dMAX2 and 
duAxz can be calculated by Eq.Q. Two more runs were also carried out with 3000 cases/3000 
controls, and 5000 cases/5000 controls. 

Figl2] shows the empirical dMAX2 and dMAx?, as a function of p^2, the tail probability for the 
Xi distribution. It can be seen that although there is only a slight reduction of dMAX2 from 
the expected value of 2, duAxz is much smaller than the expected value of 3. At = 0.05, 
the value of dMAX3 is around 2.1, consistent with a similar result of dMAxs = 2.2 in [M] . 

The empirical d values calculated from Eq.Q for CAT{x = 1) and CAT{x = 0) are also 
shown in Figj2] as a check of accuracy of the simulation. Indeed, the d values do not deviate 
from the expected value of 1 with the exception at smaller p^2 values. For low p^2 values, 
a smaller number of replicates are used in the determination of the empirical p-values, thus 
variance is large - Figl2]does show that the estimated dMAX2 and ^maxs's are not consistent 
among the three runs at (e.g.) p^2 < 0.01, an indication of large run-to-run variation. 

Another source of potential bias is that we keep the allele frequency a minimum distance 
away from the value in order to avoid the situation of zero genotype count. The range of 
allele for these 3 runs are (0.1, 0.9), (0.05, 0.95), and (0.02, 0.98) respectively. Only when 
both the sample size and number of replicates go to infinity, and with unconstrained allele 
frequency, can one expect the simulation-based estimation of dMAx values to be exact. 

4 Chi-square distributions with fractional degrees of freedom that 
fit the null distribution of Max test statistics 

The second fractional parameter value related to the MAX concerns the fitting of MAX 
null distribution by a non-integer-A; xl distribution. As mentioned in Section 1, non- integer- /c 
chi-square distribution xl can be determined easily and is indeed implemented in statistical 
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packages, such as R {http://www.r-project.org/). Here we would like to check which k value 



in Xk leads to a better fit to the MAX null distribution. 

In order to avoid confusion between k and (Imax, we made component tests to be indepen- 
dent so that cImax remains an integer. Instead of generating case and control samples with 
specific genotype then calculate the MAX2, MAX3, CAT{x = 1) and CAT{x = 0), we ran- 
domly sample two, or three independent chi-square values from the Xk=i distribution, then the 
maximization procedure is carried out. Due to the independence between chi-square values, 
dMAX2 and dMAxs should be exactly equal to 2 or 3. Here we use a different notation, Max2 
and Max3, to represent this correlation-free simulation (to be compared with Eq.([T]): 

Max2 = max(xi,Xi) 

Max3 = max{xi,xi,xl) (4) 

Figl3] shows the result of fittings the empirical Max2 and Max3 by chi-square distribution 
with non-integer degree of freedoms. Figl3](A,B) are the quantile-quantile (QQ) plot, where 
the X-axis is the ranked Max2 or Max3 value and y-axis is the ranked chi-square values with 
a fractional degrees of freedom {k=1.3, 1.4, 1.45, 1.5, 1.55, 1.6 for Max2, and A;=1.5, 1.6, 1.7, 
1.8, 1.9, 2 for Max3). To reduce variation, the average of 100 runs is used in Fig|31 When two 
distributions are identical, their QQ-plot should trace the diagonal line with slope=l (marked 
by circles). In Fig|3](A,B) chi-square distribution with a range of fractional degrees of freedom 
seem to fit the Max2 and Max3 distribution well. 

To examine more carefully how good fractional k chi-square distributions fit the Max2/Max3 
distribution, we draw the detrended QQ-plots in Figl3]^C,D), i.e., y-axis is the difference be- 
tween the sorted chi-square values with fractional k and the sorted Max2 or Max3 values. 
Fig|3t^C,D) show systematic deviations between the two distributions. In other words, no chi- 
square distribution with one single fractional k value may fit Max2 and Max3 for the entire 
range of values. For example, at Max2 ^ 5, Xi5~ Max2^ (good fit), whereas when Max2 
» 5, xL- Max2>0 (bad fit). 

It is straightforward to determine which xl crosses the zero horizontal line at what position 
in Figl3]|^C,D). First, from Eq.([2]), we see a simple relationship between the "head area" of Xi 
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and that of Max2/Max3: 

- PMax2 = 1 - Px^ 

{l-PMaxsY^^ = l-Px^- (5) 

The approximation in Eq.(l2]) becomes equahty because MAX2/MAX3 is replaced by Max2/Max3. 
Here is an example in determining the zero crossing point in Fig|3]^C): if the tail area PMax2 
for Max2 is 0.05, the "head area" is 0.95, and the corresponding head area for Xi is VO.95 = 
0.9746794. That head/tail area for Xi can be used to determine the threshold value M = 
5.001825, as marked in FigH 

Then, we choose a xl with fractional k so that its tail area determined by M = 5.001825 is 
also 0.05. As shown in FiglU the threshold value for 0.05 area for Xi is 3.841459 and that for 
X2 is 5.991465. A fractional-A; (1 < ^ < 2) should have the threshold value for 0.05 tail area 
at M = 5.001825. The exact k can be iteratively determined by a bisection method, resulting 
in k = 1.51. In other words, at PMax2 = 0.05, Xk=i.5i is equivalent to the null distribution of 
Max2. 

FiglSl^A) shows the above-mentioned fractional k value vs. the tail area probability PMax2 
or PMaxs- We are mostly interested in small tail area values, e.g. PMax2,PMax3 < 0.05, in a 
test. In this range, the equivalent fractional k is constrained from above, e.g. smaller than 1.5 
(1.85) for Max2 (Max3). We also attempt to convert the curve in Figl5](A) to a straight line 
by variable transformation. This can be accomplished by taking the cubic root of PMax2 or 
PMaxs- in Figl5](B), the fractional k vs. p]vfax2 PmLs exhibit a reasonably good linear trend. 

Besides fitting the tail area of Max2/Max3 by a fractional-/c one may also use a xl 
that has the same average/mean as Max2 or Max3. We know that the average/mean of xl 
distribution is simply k, so this fractional dimension is very easy to determine. For example, in 
our simulation the means of Max2 and Max3 are 1.64 and 2.10 respectively. The corresponding 
fractional-A; xV^ that have the same mean would be Xim ^^id xiio- Note that this fitting of 
Max2/Max3 by xl is to fit the mean which receives contribution from head as well as tail 
areas. It is not surprising that the resulting fc's are different from those that are based on tail 
areas only. Since the tail area is of major concern in most statistical inferences, we regard the 
definition of fractional df from the tail area as more useful. 
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Figs 1310 all illustrate that a single fractional-/c xl cannot fit the Max2/Max3 distribution 
perfectly. In particular, Fig|3](C,D) shows that the deviation between the two is directional: 
the matching xt has a fatter tail than Max2/Max3 beyond the crossing point. One method 
to remove the systematic deviation in Figl3t^C,D) is to use a linear function. For example, 
FiglHt^D) show the result when the —0A5+0.04:xl=n linear trend is removed from the detrended 
QQ-plot of Xk=i 7 against Max3. It is equivalent to an approximation of sorted Max3 by 
0.45 + 0. 96 sort {Xk=i.7)- Although it is not a perfect approximation, nor a unique one, the 
trend removal does reduce the systematic deviation. 

5 Case-control genetic data 

The result from the last section cannot be applied to the case-control data directly because 
the individual test statistics in Eq{T]are not independent, in particular for MAX3. There have 
been attempts to derive the null distribution of MAX3 by considering the joint distribution of 
CAT{x = 0), CAT{x = 0.5), and CAT{x = 1) [2ll[25]. From Fig.3(B,D) and Fig.5(B), it is 
seen that we should not expect a single xl with a fractional k to fit the MAX3 distribution 
perfectly. 

The questions we asked for a real case-control data are: (1) what are the approximate values 
of /c if a Xk is forced to fit the tail area probability of MAX3? (2) how good is our approximate 
distribution of Max3, 0.45+0.96 Xk=i ii fitting MAX3? For answering these questions, we 
use the case-control data for type 2 diabetes provided in [26] . 

The tail-area probability of MAX3 can be empirically obtained by permutation: the affec- 
tion status label of samples are randomly shuffled, then the genotype counts are reconstructed. 
From such a genotype count table, the MAX3 value can be determined. Repeated calculation 
of MAX3 in label-shuffled dataset provides a null distribution, and from which one can derive 
the tail- area probability. The Pmaxz thus determined for the top SNPs in |26] is reproduced in 
Table 1. From the permutation-derived Pmaxs-, we find the best-fit xl that leads to the same 
Pmax?, value, and that fractional degrees of freedom k is listed in Table 1. A range of values 
of k between 1.2 and 1.7, very similar to the range used in Fig. 3(D). 

Next, we estimate the tail area probability of MAX3 by an approximate formula discussed 
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gcne/SNP 


MAX3 


PMAX3 (pcrmutation)(i) 




PMAX3 (by k = 1.7 formula)(3) 


PMAx (exact) 


TCF7L2/rs7900150 


34.18437 


2.1 xlO-^ 


1.676 


1.36 xlO-* 


1.29 xlO-^ 


CAMTAl/rsll93179 


25.71149 


6.3 xlO"^ 


1.213 


1.17 xlO-6 


1.00 xl0~6 


CXCR4/rs932206 


23.28708 


2.8 xlQ-s 


1.336 


4.19 xlQ-^ 


3.67 xlO-6 


ZNF615/rsl978717 


23.11983 


4.9 xlO-6 


1.595 


4.57 xlO-6 


4.01 xl0~6 


HHEX/rsllll875 


22.01918 


8.6 xlO-6 


1.597 


8.17 xlQ-^ 


7.82 xlO-^ 


LOC644419/rs282705 


21.93485 


9.0 xlO-'^ 


1.598 


8.54 xlQ-^ 


6.27 xlO-^ 



Tabic 1: SNPs taken from the Table S4 of supplementary material of [26] with tail area probability (obtained 
from permutation) smaller than 10^^, and if more than SNPs in a gene are significant at this level, only one 
SNP is chosen here. The first two columns list the gene/SNP name and the MAX3 value (based on the genotype 
counts given in the supplementary material of [H]). (1) values of tail area probability provided by [55]; (2) 
the best fit of k when the values in column "(1)" is used to fit a xi distribution; (3) estimation of the tail area 
probability of MAX3 by the distribution (for Max3) of 0.45+ 0.96xLi.7; (4) tail area probability of MAX3 by 
the exact enumeration of all possible combinations. 

in the last section (and Fig.3(D)) for Max3. Due to the difference of Max3 and MAX3, and the 
approximation nature of the formula, we do not expect the derived tail area probabilities to 
be exact. Surprisingly, from the result in Table 1, this approximation actually leads to Pmaxs 
that are similar to those obtained from permutation in |26j . 

Permutation only provides a sampling of the null distribution, and the finite number of 
replicates could be a source of error. Mimicking Fisher's exact test, which determines the 
tail area probability by counting the number of states in the tail area by combinatorics, we 
can also determine the exact value of Pmaxs (J- Tian, C. Xu, H. Zhang, Y. Yang, paper in 
preperation) . This exact tail area probability is listed in the last column of Table 1. Again, 
we see that the approximation of Pmaxs based on fractional-Zc chi-square distribution isn't far 
off from the exact values. 

6 Discussion 

In this paper, we introduce two fractional parameter values for MAX test statistics: (1) the 
fractional number of tests (Imax and (2) the fractional degree of freedom k for the chi-square 
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distribution that fits the Max null distribution. The parameter (Imax has its counterparts in 
other fields, such as the effective number of parameters for model selection [271 128| [29l [30] , 
effective number of genetic markers that are in linkage equilibrium [311 [32], effective number of 
grid points required to represent a climate field [33], effective sample size in genetic study for 
relatives [34] , etc. It was stated in [35] that between the two extreme situations of two tests 
being independent and being identical, "an intermediate answer is to be anticipated" . In one 
particular situation, they actually have an example of 1.5 effective number of tests (page 340 
of [35]). 

There are two universal themes in these diverse studies: (1) Positive correlation causes the 
effective number to be smaller than the apparent number. This has several consequences, such 
as dimension reduction as a technique to simplify the dataset, correct ways for comparing 
statistical models by using the effective number of parameters to measure model complexity, 
etc. (2) As the effective number is determined from the real data, its value is most likely to 
be non-integer. Fractionality is the rule, not an exception. 

The Bonferroni correlation of p-value for multiple testing is known to be conservative. The 
very reason that it is conservative is because tests can be positively correlated, which is also 
the cause for reduced values of effective number of tests. Various attempts were made to take 
into account of correlation among tests making a correction less conservative [36 1 137 1 138]. Our 
simulation results show that the reduction of effective number of tests for MAX2 is very small, 
indicating that CAT{x = 0) and CAT{x = 1) are not strongly correlated. However, there is a 
large reduction in the effective number of tests for MAX3, and the multiple factor of 3 is not 
appropriate in Bonferroni correction for MAX3. 

Non-integer degrees of freedom k for xl is our second fractional parameter, which had been 
encountered occasionally in statistical literature (e.g., [39]). The fact that df can be non- 
integer is not surprising by itself, but it is more interesting to ask the question on whether it 
has any geometric interpretation. Our case-control association analyses example may provide 
a hint, as there is a tangible link between the k value and the size of area in the disease model 
space. 

A disease model can be specified by 4 parameters (see Appendix), but a projection from 
the 4-dimensional space to 2-dimensional one is possible. Using a 2-by-3 genotype count table 
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as a realization of a disease model, FigJHl shows two different ways to map a 2-by-3 genotype 
count table onto a two-dimensional plane. The first, as shown in FiglHt^A), uses the case-control 
difference of Hardy- Weinberg disequilibrium coefficients (6^) and case-control difference of allele 
frequency {6p) (see Appendix) [151 [16]. The second, as shown in Figl6](B), uses the odd ratio 
of the baseline and heterozygote genotype {ORi) and the odd ratio of the baseline and risk 
homozygote genotype {OR2) (see Appendix) [ID] . 

In the absence of constraints, randomly sampled disease model could scatter within a 
bounded plane in Fig|6](an outer bound for Figl6](A) could be: —1 < 5p < 1, —1/2 < 5^ < 1/2), 
whereas disease models in a given class are located in a more restricted subspace, such as a 
line segment. We randomly sample dominant, recessive, multiplicative models and use them 
to generate dataset with 1000 case and 1000 control samples, these generated genotype count 
tables are mapping to 2-dimensional space in FiglHl In FigJ^A), multiplicative models are 
located along the y-axis as this model does not lead to Hardy- Weinberg disequilibrium; and 
recessive (dominant) models are located in regions with positive (negative) 6^ values [ ^ [T5 | [T6]. 
Similarly, in Figj6](B), dominant models are located along the line with slope 1 {OR2 = ORi), 
multiplicative models are located in the line with slope 2 (log(0-R2)/ log(O-Ri) = 2), and 
recessive models are on the vertical line {ORi = 1 and arbitrary 0R2)- 

If we sort different test statistics according to their corresponding degrees of freedom k in 
Xfc for the null distribution, the following order appears: test on 2-by-3 genotype count table 
{k = 2), MAX3 {d ^ 1.57), MAX2 {d ^ 1.5), CAT{x = 0.5) or CAT{x = 0) or CAT{x = 1) 
{k = 1). On the projected disease model space in FigjGl there is also a gradual narrowing of 
models for which these tests are designed to detect: genotype count test targets any models in 
the 2-dimensional space, MAX3 targets three types of models represented by 3 line segments, 
MAX2 targets two types of models represented by 2 line segments, and CAT{x) targets only 
one line segments. 

In Figini the line segments for the three types of disease models are somewhat blurred into 
wider areas, but it is caused by random realization of datasets, rather than a manifestation 
of a fractal geometry. However, the fractional df = k moves up from the integer value 1 to 
1.5 and 1.57 when the number of line segments is increased. From this observation, we do 
not believe fractional k is related to a fractional dimension of the underlying disease model 
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subspace. 

Even without a geometric interpretation, we may propose another meaning for k: k — 1 can 
be used to measure the level of uncertainty in inferred disease model (mode of inheritance). For 
CAT{x), k — 1 = 0, and a significant test result also provides certain information concerning 
disease model. For genotype test, k — 1 = 1 and a significant test does not tell us anything 
about the disease model. A significant MAX2 test result provides some information on disease 
model (for example, that the true model is unlikely to be multiplicative), whereas MAX3 offers 
even less information. If we consider the detection of association signal and inference of disease 
model as two independent tasks of a genetic association study, then these two components are 
reminiscent of those studied in the uncertainty principle in quantum physics [42], such as 
measuring the position and velocity of a particle at the same time. 

In conclusion, MAX test provides an interesting example where two non-integer quantities 
can be defined and measured. The effective number of tests to be maximized is more straight- 
forward and has appeared in other applications as well. The fractional-A; distribution for a 
test statistic is more intriguing, and seems to have a profound meaning concerning the test's 
ability to infer specific information. We have shown that a linear function of fractional-fc 
distribution approximates the true distribution of MAX quite well. A hallmark of complex 
systems is its intermediate state between two extremes (order and disorder): a similarly inter- 
mediate state can also be described for fractional degrees of freedom in the MAX test which, in 
the genetic analysis context, sit between testing genetic association under completely specified 
and completely unknown disease models. 
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Appendix: Basic notations and results for case-control genetic tests 

A case-control dataset consists of A'^i case samples and A'^g control samples whose genotype [aa 
is the baseline homozygote, aA is the heterozygote, AA is the risk homozygote) is known. The 
dataset can be represented by a 2-by-3 genotype count table: 





dbdb 


aA 


AA 


sample size 


case(l) 




A^ii 


A^12 


A^i = A^i* 


control(O) 




A^oi 


A^02 


A^o = A^o* 


combined 




A^*i 


A^*2 


N = Ni + No 



The above 2-by-3 genotype count table can be collapsed to several 2-by-2 tables. The 
following collapsing corresponds to a dominant model (the risk allele A "dominates" allele 
"a"): 

dbdb aA+AA 

case(l) A^io A^n + A^i2 
control (0) A'oo A'oi + A'02 
and the following collapsing corresponds to a recessive model (only two copies of the risk allele 
A present a disease risk): 

aa+aA AA 
case(l) A^io + A^n A^i2 
control (0) A'oo + A'oi A'02 
From a 2-by-2 table, Pearson's chi-square test statistic is of the form of "^row coi('^row,coi — 
Erow,coiY / Erow,coi whcrc Orow,coi IS the obscrvcd (genotype) count is a table cell indexed by 
"row" and "column", and Erow,coi is the expected count. The expected count is equal to the 
product of the row margin Orow,* = Y^coi Orow,coi and the column margin O^^^oi = Y^row Orow, col- 
li can be shown that is the product of squared matrix determinant and total sample size 
divided by the product of 4 row and column margins (e.g., [I5]). For example, for the recessive 
model, X'^ is: 

D = {Nw + Nu)No2-{Noo + Noi)Ni2 
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Under the null hypothesis (by chance alone), X'^ follows the Xk=i distribution (chi-square 
distribution with one degree of freedom). X'^qj^j can be calculated similarly. 

The Cochran-Armitage trend {CAT) test is defined after each genotype is assigned a score. 
Most assignment of the genotype score could be equivalent to a score of {xj} = (0,a;, 1), i.e., 
the score for the baseline homozygote is fixed at 0, that for the risk homozygote is fixed at 
1, and that for the heterozygote is a parameter x. The CAT test statistic at x is defined as 

([131 Ellin]: 

CAT{x) = °* V ^ ^ 

It can be shown that CAT{x = 0) is equal to Xj^^^ and CAT{x = 1) is equal to X'j^Qj^j. Under 
the null hypothesis, CAT{x) at each fixed x value follows the x\=i distribution. 

A disease model of a bi-allelic disease locus can be specified by 4 parameters. One is the 
allele frequency {p = p^) and the other three characterize the susceptibility of the disease under 
each genotype: (/o,/i,/2) = {P{disease\aa), P{disease\Aa), P{disease\AA)). The latter three 
parameters can be replaced by the following three parameters: relative genotype risk for 
heterozygote: Ai = /i//o, that for the risk homozygote, A2 = /2//05 and disease prevalence 
K = /o[(l - pY + Ai2p(l -p) + \2p'^]- Either a {pa, fo, /i, /2) value or a {pa, Ai, A2, K) value 
uniquely determines a disease model. 

There are several ideas in reducing the number of parameters of a disease model from 4 to 
2 "major" parameters. One suggestion [15] is to use the allele frequency difference in case and 
in control group 6p = pA{case) — pA^control) = pi — po, and Hardy- Weinberg disequilibrium 
coefficient difference in the two groups 6^ = e{case) — e{control) = ei— cq. The Hardy- Weinberg 
disequilibrium coefficient e measures the deviation from Hardy- Weinberg equilibrium [3], such 
that the three genotype frequencies can be written as ((1 —p)"^ + e, 2p(l —p)— 2e, p^ + e). The 
motivation for this parameterization is that 6p is directly related to the case-control association 
signal, and 5^ is strongly correlated with the disease model. 

The group-specific allele frequency and Hardy- Weinberg disequilibrium coefficient can be 
determined from the 4 parameters p, Xi, X2, K [H], and their differences can be determined as 
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well ^ ^ 

/o(p^A2+p(l-p)Ai) /(l-/oA2)+p(l-p)(l-/oAi) 
5p=Pi-Po = ^ ^^^^ 

. _ _ /oV(l-p)^(A2-A^) f,p\l-pf{2X^-l-X,-foXl + foX,) 

Given a 2-by-3 genotype table, these two parameters can be estimated by: 

, . . iVi2 + iVii/2 iVo2 + iVoi/2 
= P'-P' = A^^ A^^ 

, . . Nu /iVi2 + iVii/2V No2 , /iVo2 + iVoi/2V 
= ''-'' = l^-[-^^ ) -^+1 iVo ) • 

Another idea in selecting two major parameters in the disease model is to ignore p and K, 

and focus only on Ai and A2. These two parameters can be estimated by the two odd-ratios 

from subtables consisting of one baseline column and another risk column: 



Ai = ORi 
X2 = OR2 



NuNoo 



N10N02 

Figj6]^A) and (B) illustrate these two ideas of a two-dimensional disease model space. 
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distribution of CAT simulated with a dominant model 
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Figure 1: The distribution of CAT{x = 1), CAT{x = 0.5) and CAT{x = 1) from the 100,000 rephcates 
generated by a dominant model: population risk allele frequency p ~ 0.1, penetrance for baseline homozygote 
is 0.005, and genotype relative risk for both heterozygotc and the risk homozygote is Ai = A2 = 2. The 
genotype frequency for the case and the control group is calculated by the formula given in [41] . 
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Figure 2: Fractional number of tests for MAX2 and MAX3 {dMAX2, d-MAXs) determined by Eq.([3]) with three 
simulation runs, as a function of tail area probability under Xi (Pxf)' Each run contains 100,000 replicates 
of genotype count tables for 1000 cases and 1000 controls (3000 cases/3000 controls, 5000 cases/5000 controls 
for the second and the third run). As a comparison, the effective number of tests for CATix ~ 0) and for 
CAT{x = 1) as determined by simulation is also included. As expected, these effective number of tests is 
essentially equal to 1. 
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Figure 3: Quantile-quantilc (QQ) plot of Max2/Max3 against values sampled from xl with fractional degrees 
of freedom k. (A) QQ plot of Max2 against values sampled from x^'s with fc = 1, 1.3, 1.4, 1.45, 1.5, 1.55, 1.6. The 
circles indicate the QQ-plot between two identical distributions. (B) QQ plot of Max3 against values sampled 
from x^'s with k = 1, 1.3, 1.5, 1.6, 1.7, 1.8, 1.9, 2. (C) Detrended QQ plot of Max2 against values sampled from 
X^'s. (D) Detrended QQ plot of Max3 against values sampled from Xfe's- The crosses represent the detrended 
QQ-plot for 0.45 + 0.96xLi.7 against Max3. 
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test statistic [log scale] 

Figure 4: Probability density distribution of Xi (labeled by 1), (labeled by 2), Xfc=i 51 (dashed line), and 
simulated Max2 (solid line). The threshold value M's that correspond to tail area of 0.05 for these distributions 
are also marked. The Xk=i 51 distribution has the same M as the Max2 distribution, so the two are equivalent 
at the tail area of 0.05. 
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Figure 6: Simulation of 100 case-control datasets each for three classes of models (dominant, multiplicative, 
recessive). Each point represents a genotype count table for 1000 cases and 1000 controls. The allele frequency 
is randomly chosen from (0.1-0.9); disease prevalence is sampled from the normal distribution with a random 
mean, and standard deviation of 1/10 of the mean; the A2 genotype relative risk is randomly chosen between 
(1.1-10); Ai is equal to A2, VA2j ^i^d 1 for dominant, multiplicative, and recessive models. (A) The location 
of simulated datasets in the S^-dp parameter space, where (5e is the case-control difference of Hardy- Weinberg 
disequilibrium coefhcicnts and Sp is the case-control difference of allele frequencies. The symbols "d" , "m" , "r" 
represent dominant, multiplicative, and recessive models, respectively. (B) The location of the same simulated 
case-control datasets in the OR1-OR2 space (both x and y-axis are in log scale), where ORi is the odd-ratio 
of heterozygote genotype vs. baseline homozygote genotype, and OR2 is the odd-ratio of risk homozygote 
genotype vs. baseline homozygote genotype. 



